Single-Pass PCA of Large High-Dimensional Data

نویسندگان

  • Wenjian Yu
  • Yu Gu
  • Jian Li
  • Shenghua Liu
  • Yaohang Li
چکیده

Principal component analysis (PCA) is a fundamental dimension reduction tool in statistics and machine learning. For large and high-dimensional data, computing the PCA (i.e., the top singular vectors of the data matrix) becomes a challenging task. In this work, a single-pass randomized algorithm is proposed to compute PCA with only one pass over the data. It is suitable for processing extremely large and high-dimensional data stored in slow memory (hard disk) or the data generated in a streaming fashion. Experiments with synthetic and real data validate the algorithm’s accuracy, which has orders of magnitude smaller error than an existing single-pass algorithm. For a set of highdimensional data stored as a 150 GB file, the algorithm is able to compute the first 50 principal components in just 24 minutes on a typical 24-core computer, with less than 1 GB memory cost.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification of mineralization features and deep geochemical anomalies using a new FT-PCA approach

The analysis of geochemical data in frequency domain, as indicated in this research study, can provide new exploratory informationthat may not be exposed in spatial domain. To identify deep geochemical anomalies, sulfide zone and geochemical noises in Dalli Cu–Au porphyry deposit, a new approach based on coupling Fourier transform (FT) and principal component analysis (PCA) has beenused. The re...

متن کامل

Streaming, Memory-Limited PCA

In this paper, we consider a streaming one-pass-over-the-data model for Principal Component Analysis (PCA). The input, in this case, is a stream of p-dimensional vectors, and the output is a collection of k, p-dimensional principal components that span the best approximating subspace. Consequently, the minimum memory requirement for such problems is O(kp). Yet the standard PCA algorithm require...

متن کامل

Face Detection at the Low Light Environments

Today, with the advancement of technology, the use of tools for extracting information from video are much wider in terms of both visual power and the processing power. High-speed car, perfect detection accuracy, business diversity in the fields of medical, home appliances, smart cars, humanoid robots, military systems and the commercialization makes these systems cost effective. Among the most...

متن کامل

Linear Modelling for Spectral Images based on Truncated Fourier Series

Reflectance spectra of hyperspectral images of the natural scenes are supposed to represent the real world better than any certain classes of natural and man-made spectral reflectance. But spectral images contain a large volume of data and place considerable demands on computer hardware and software compared with standard trichromatic images. Although principal component analysis (PCA) based lo...

متن کامل

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017